{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "parental-advance",
   "metadata": {},
   "source": [
    "# ISBN Numbers"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "confused-macintosh",
   "metadata": {},
   "source": [
    "## Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "central-participation",
   "metadata": {},
   "source": [
    "The function `clean_isbn()` cleans a column containing ISBN strings, and standardizes them in a given format. The function `validate_isbn()` validates either a single ISBN strings, a column of ISBN strings or a DataFrame of ISBN strings, returning `True` if the value is valid, and `False` otherwise."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "solved-audience",
   "metadata": {},
   "source": [
    "Currently, an ISBN is composed by following components:\n",
    "\n",
    "* 3-digit (only in ISBN-13) Bookland Code\n",
    "* 1 to 5-digit Group Identifier (identifies country or language)\n",
    "* 1 to 7-digit Publisher Code\n",
    "* 1 to 8-digit Item Number (identifies the book)\n",
    "* a Check Digit\n",
    "\n",
    "Usually, the delimiters of ISBN is `-`.\n",
    "\n",
    "ISBN strings can be converted to the following formats via the `output_format` parameter:\n",
    "\n",
    "* `compact`: only number strings without any seperators, like \"9789024538270\"\n",
    "* `standard`: ISBN strings with proper delimiters in the proper places, like \"978-90-245-3827-0\"\n",
    "* `isbn13`: standard format of 13-digits ISBN string, like \"978-90-245-3827-0\"\n",
    "* `isbn10`: standard format of 10-digits ISBN string, like \"90-245-3827-0\"\n",
    "\n",
    "Invalid parsing is handled with the `errors` parameter:\n",
    "\n",
    "* `coerce` (default): invalid parsing will be set to NaN\n",
    "* `ignore`: invalid parsing will return the input\n",
    "* `raise`: invalid parsing will raise an exception\n",
    "\n",
    "The following sections demonstrate the functionality of `clean_isbn()` and `validate_isbn()`. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "mathematical-strip",
   "metadata": {},
   "source": [
    "### An example dataset containing ISBN strings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "underlying-shooting",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "df = pd.DataFrame(\n",
    "    {\n",
    "        \"isbn\": [\n",
    "            \"978-9024538270\",\n",
    "            \"978-9024538271\",\n",
    "            \"1-85798-218-5\",\n",
    "            \"9780471117094\",\n",
    "            \"1857982185\",\n",
    "            \"hello\",\n",
    "            np.nan,\n",
    "            \"NULL\"\n",
    "        ], \n",
    "        \"address\": [\n",
    "            \"123 Pine Ave.\",\n",
    "            \"main st\",\n",
    "            \"1234 west main heights 57033\",\n",
    "            \"apt 1 789 s maple rd manhattan\",\n",
    "            \"robie house, 789 north main street\",\n",
    "            \"1111 S Figueroa St, Los Angeles, CA 90015\",\n",
    "            \"(staples center) 1111 S Figueroa St, Los Angeles\",\n",
    "            \"hello\",\n",
    "        ]\n",
    "    }\n",
    ")\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "raising-plane",
   "metadata": {},
   "source": [
    "## 1. Default `clean_isbn`\n",
    "\n",
    "By default, `clean_isbn` will clean isbn strings and output them in the standard format with proper separators."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "accessible-walker",
   "metadata": {},
   "outputs": [],
   "source": [
    "from dataprep.clean import clean_isbn\n",
    "clean_isbn(df, column = \"isbn\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "strategic-fifth",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_clean = clean_isbn(df, column = \"isbn\", inplace=True)\n",
    "df_clean"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "designing-space",
   "metadata": {},
   "source": [
    "## 2. Output formats"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eastern-quilt",
   "metadata": {},
   "source": [
    "This section demonstrates the output parameter."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "working-ranking",
   "metadata": {},
   "source": [
    "### `standard` (default)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cloudy-lloyd",
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_isbn(df, column = \"isbn\", inplace=True, output_format=\"standard\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "collectible-candle",
   "metadata": {},
   "source": [
    "### `compact`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "every-angle",
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_isbn(df, column = \"isbn\", inplace=True, output_format=\"compact\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "mighty-technology",
   "metadata": {},
   "source": [
    "### `isbn13`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "individual-grant",
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_isbn(df, column = \"isbn\", inplace=True, output_format=\"isbn13\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "advanced-nightlife",
   "metadata": {},
   "source": [
    "### `isbn10`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "turkish-phenomenon",
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_isbn(df, column = \"isbn\", inplace=True, output_format=\"isbn10\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "labeled-think",
   "metadata": {},
   "source": [
    "## 3. `split` parameter\n",
    "\n",
    "The `split` parameter adds individual columns containing the cleaned 13-digits ISBN values to the given DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "signal-testimony",
   "metadata": {},
   "outputs": [],
   "source": [
    " clean_isbn(df, column=\"isbn\", split=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "technological-updating",
   "metadata": {},
   "source": [
    "## 4. `inplace` parameter\n",
    "\n",
    "This deletes the given column from the returned DataFrame. \n",
    "A new column containing cleaned ISBN strings is added with a title in the format `\"{original title}_clean\"`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "recent-franchise",
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_isbn(df, column=\"isbn\", inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "covered-distributor",
   "metadata": {},
   "source": [
    "### `inplace` and `split`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "expressed-exhibition",
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_isbn(df, column=\"isbn\", inplace=True, split=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "helpful-green",
   "metadata": {},
   "source": [
    "## 5. `errors` parameter"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "needed-dynamics",
   "metadata": {},
   "source": [
    "### `coerce` (default)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ambient-spice",
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_isbn(df, \"isbn\", errors=\"coerce\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "verbal-detroit",
   "metadata": {},
   "source": [
    "### `ignore`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aggressive-uncle",
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_isbn(df, \"isbn\", errors=\"ignore\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "later-closing",
   "metadata": {},
   "source": [
    "## 4. `validate_isbn()`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "historic-cholesterol",
   "metadata": {},
   "source": [
    "`validate_isbn()` returns `True` when the input is a valid phone number. Otherwise it returns `False`.\n",
    "\n",
    "The input of `validate_isbn()` can be a string, a Pandas DataSeries, a Dask DataSeries, a Pandas DataFrame and a dask DataFrame.\n",
    "\n",
    "When the input is a string, a Pandas DataSeries or a Dask DataSeries, user doesn't need to specify a column name to be validated. \n",
    "\n",
    "When the input is a Pandas DataFrame or a dask DataFrame, user can both specify or not specify a column name to be validated. If user specify the column name, `validate_isbn()` only returns the validation result for the specified column. If user doesn't specify the column name, `validate_isbn()` returns the validation result for the whole DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "commercial-press",
   "metadata": {},
   "outputs": [],
   "source": [
    "from dataprep.clean import validate_isbn\n",
    "print(validate_isbn(\"978-9024538270\"))\n",
    "print(validate_isbn(\"978-9024538271\"))\n",
    "print(validate_isbn(\"1-85798-218-5\"))\n",
    "print(validate_isbn(\"9780471117094\"))\n",
    "print(validate_isbn(\"1857982185\"))\n",
    "print(validate_isbn(\"hello\"))\n",
    "print(validate_isbn(np.nan))\n",
    "print(validate_isbn(\"NULL\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "central-apollo",
   "metadata": {},
   "source": [
    "### DataFrame + Specify Column"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "minus-block",
   "metadata": {},
   "outputs": [],
   "source": [
    "validate_isbn(df, column=\"isbn\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bacterial-warning",
   "metadata": {},
   "source": [
    "### Only DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "informal-employer",
   "metadata": {},
   "outputs": [],
   "source": [
    "validate_isbn(df)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
